[Enh]: Add `Expr.map_batches` to pyspark by pedro-villanueva-bcom · Pull Request #3579 · narwhals-dev/narwhals

pedro-villanueva-bcom · 2026-04-28T11:12:17Z

Description

Expr.map _batches can be used when native expressions aren't enough, for example for statistical functions. Pyspark has several types of UDFs, including pandas UDFS that matches very well with map _batches. This PR implements map _batches using pandas UDFs. The optional param returns_scalar is not supported, as pyspark doesn't allow this. UDF must return either a pandas Series, something that can be transformed into one, or a scalar that will be broadcast to one.

The only change external to the spark backend is in the kind of the map_batches node, which has been changes from ordered to unordered.
Additionally, the testing fixture that creates the spark session now add the PYSPARK_PYTHON env var so that UDFs use that python to run (including using whatever packages are installed).

What type of PR is this? (check all applicable)

Related issues

Related issue #<issue number>
Closes [Enh]: Add map_batches to pyspark #3578

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

pedro-villanueva-bcom · 2026-04-28T11:28:28Z

~~I messed-up the commits, sorry~~ fixed

pedro-villanueva-bcom · 2026-04-28T11:56:29Z

I don't understand this test failure: https://github.com/narwhals-dev/narwhals/actions/runs/25050142610/job/73375481418?pr=3579
It runs fine on my local machine. The error says pyspark.errors.exceptions.base.PySparkTypeError: [NOT_COLUMN_OR_STR] Argument col should be a Column or str, got Column., which is really weird.

This test coverage test is also strange: https://github.com/narwhals-dev/narwhals/actions/runs/25050142614/job/73375481310?pr=3579
It doesn't run for pyspark, so the coverage is below 100% because of the new code, should I add pyspark to that test?

FBruzzesi · 2026-04-28T14:50:30Z

Hey @pedro-villanueva-bcom - thanks for taking the initiative! I am not sure we should support map_batches for lazy backends - I am open to see how this will play out!

Regarding your questions:

I don't understand this test failure: narwhals-dev/narwhals/actions/runs/25050142610/job/73375481418?pr=3579 It runs fine on my local machine. The error says pyspark.errors.exceptions.base.PySparkTypeError: [NOT_COLUMN_OR_STR] Argument col should be a Column or str, got Column., which is really weird.

I am not sure, but it would not be the first time that something is passing for pyspark but not for pyspark-connect.
It's ok to explicitly fail for spark-connect if you are not able to replicate and fix the issue.

This test coverage test is also strange: narwhals-dev/narwhals/actions/runs/25050142614/job/73375481310?pr=3579 It doesn't run for pyspark, so the coverage is below 100% because of the new code, should I add pyspark to that test?

Coverage is calculated with SQLFrame backend, so you will need to add a # pragma: no cover for the entire method if SQLFrame is not supported.

pedro-villanueva-bcom · 2026-04-28T15:18:03Z

I am not sure we should support map_batches for lazy backends - I am open to see how this will play out!

Any specific reason for this? In my mind (and use case) udfs are just another type of expression to create a column. It has performance implications for sure, but in my case there's no other choice (that's mostly statistical functions like getting a p-value from a column with a z score for example).
Additionally, I wanted to look into how to support aggregation udfs. I see that polars has a map_group function. I also use the pyspark equivalent for, again, statistical summaries that can't be calculated in other ways (this case is a little less frequent though).

My use case is a library of statistical functions for large datasets that works for pyspark, pyspark connect and snowpark. I want to make it work for in-memory backends too to deal with small data and make testing faster too. I discovered narwhals and I'm quite happy with it. The syntax is nice (nicer than ibis) and migrating is not super hard.
Let me know if you want to talk about this more, happy to do it

for more information, see https://pre-commit.ci

MarcoGorelli · 2026-05-01T13:43:33Z

thanks for the PR!

looking at the docs

Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and Pandas to work with the data, which allows pandas operations

I'm also inclined to decline this feature i'm afraid, as it looks like a massive performance footgun

Which statistical summaries are you looking to do?

In any case, I think I'd suggest making a helper function within your code for this, I'm extremely hesitant to use anything from the pandas api in pyspark

pedro-villanueva-bcom · 2026-05-01T14:03:01Z

Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and Pandas to work with the data, which allows pandas operations

I'm also inclined to decline this feature i'm afraid, as it looks like a massive performance footgun

Which statistical summaries are you looking to do?
In any case, I think I'd suggest making a helper function within your code for this, I'm extremely hesitant to use anything from the pandas api in pyspark

They are less performant that native expressions, but there's a reason pyspark (and snowpark, and all other backends I know of) allow for udfs: sometimes that's the only way to get something done. I think that the end user has to make the choices and tradeoffs, not the library. We can emit a warning like when using the pandas backend apply has to be called with a complex aggregation.

Example udf:

@udf(packages=["scipy"])
def norm_p_value(values):  # type: ignore[no-untyped-def]
    """Normal distribution survival function (vectorized)."""
    import scipy.stats as stats
    if value is None:
        return None

    result = stats.norm.sf(value)
    return result.item() if np.ndim(result) == 0 else result

udf is a custom decorator that distributes the dependencies to the right backend (snowpark needs to know that the udf will use scipy), and because map_batches is not available, creates a pyspark udf with the right types (so it essentially reproduces what this PR does).

Then I have to use it like

tmp_input_col = f"__{col}_abs"
df = df.with_columns(nw.col(z_col).abs().alias(tmp_input_col))
df = norm_p_value(df, tmp_input_col, output_col)f.drop(tmp_input_col)

Instead of simply

df = df.select(nw.col("col").abs().map_batches(norm_p_value))

which then I can keep chaining.

Not having map_batches is not a deal breaker, as I can recreate it, but it makes things harder and narwhals less useful as a common layer.

pedro-villanueva-bcom force-pushed the add_pyspark_map_batches branch from c95fa6f to e5b315f Compare April 28, 2026 11:25

pedro-villanueva-bcom force-pushed the add_pyspark_map_batches branch 2 times, most recently from 491bdd7 to 0d622c0 Compare April 28, 2026 14:30

pedro-villanueva-bcom marked this pull request as draft April 28, 2026 14:31

pedro-villanueva-bcom marked this pull request as ready for review April 29, 2026 10:09

pedro-villanueva-bcom changed the title ~~[Enh]: Add Expr.map _batches to pyspark~~ [Enh]: Add Expr.map_batches to pyspark Apr 29, 2026

pedro-villanueva-bcom and others added 11 commits April 30, 2026 10:34

use a pandas_udf to implement map_batches for pyspark

58376ae

map_batches doesn't require ordering

e659470

test for pyspark

d59f152

add env var so pyspark udfs can use installed packages

e482400

[pre-commit.ci] auto fixes from pre-commit.com hooks

ab602cd

for more information, see https://pre-commit.ci

test only valid for pyspawrk

a1ee15a

don't import pandas directly

e489de2

coverage only checks sqlframe

a123a85

don't run tests to pyspark[connect]

fe6f0ff

remove print

c0f7457

parametrize test

00eaa83

pedro-villanueva-bcom force-pushed the add_pyspark_map_batches branch from 1572e5f to 00eaa83 Compare April 30, 2026 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enh]: Add `Expr.map_batches` to pyspark#3579

[Enh]: Add `Expr.map_batches` to pyspark#3579
pedro-villanueva-bcom wants to merge 11 commits intonarwhals-dev:mainfrom
pedro-villanueva-bcom:add_pyspark_map_batches

pedro-villanueva-bcom commented Apr 28, 2026

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026 •

edited

Loading

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026

Uh oh!

FBruzzesi commented Apr 28, 2026

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026 •

edited

Loading

Uh oh!

MarcoGorelli commented May 1, 2026

Uh oh!

pedro-villanueva-bcom commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pedro-villanueva-bcom commented Apr 28, 2026

Description

What type of PR is this? (check all applicable)

Related issues

Checklist

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026

Uh oh!

FBruzzesi commented Apr 28, 2026

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented May 1, 2026

Uh oh!

pedro-villanueva-bcom commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pedro-villanueva-bcom commented Apr 28, 2026 •

edited

Loading

pedro-villanueva-bcom commented Apr 28, 2026 •

edited

Loading